Non-Autoregressive Coarse-to-Fine Video Captioning

نویسندگان

چکیده

It is encouraged to see that progress has been made bridge videos and natural language. However, mainstream video captioning methods suffer from slow inference speed due the sequential manner of autoregressive decoding, prefer generating generic descriptions insufficient training visual words (e.g., nouns verbs) inadequate decoding paradigm. In this paper, we propose a non-autoregressive based model with coarse-to-fine procedure alleviate these defects. implementations, employ bi-directional self-attention network as our language for achieving speedup, on which decompose into two stages, where different focuses. Specifically, given determine semantic correctness captions, design mechanism not only promote scene-related but also capture relevant details construct coarse-grained sentence ``template''. Thereafter, devise dedicated algorithms fill in ``template'' suitable modify inappropriate phrasing via iterative refinement obtain fine-grained description. Extensive experiments benchmarks, i.e., MSVD MSR-VTT, demonstrate approach achieves state-of-the-art performance, generates diverse descriptions, obtains high efficiency.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Stack-Captioning: Coarse-to-Fine Learning for Image Captioning

The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...

متن کامل

Fast coarse-to-fine video retrieval via shot-level statistics

We propose a fast coarse-to-fine video retrieval scheme using shot-level spatio-temporal statistics. The proposed scheme consists of a two-step coarse search and a fine search. At the coarse-search stage, the shot-level motion and color distributions are computed as the spatio-temporal features for shot matching. The first-pass coarse search uses the shotlevel global statistics to cut down the ...

متن کامل

A Coarse-to-Fine Human Body Segmentation Method in Video

Human body precise segmentation is difficult because of inter-occlusion when there are multiple human bodies in video. A coarse-to-fine segmentation method is proposed. In coarse segmentation, human shape models are used to get human’s position and coarse region. The human models with variant scale and posture are constructed with head, torso, and legs. For each human body, its corresponding hu...

متن کامل

Coarse - to - Fine Visual

We study visual selection: Detect and roughly localize all instances of a generic object class, such as a face, in a greyscale scene, measuring performance in terms of computation and false alarms. Our approach is sequential testing which is coarse-tone in both in the exploration of poses and the representation of objects. All the tests are all binary and indicate the presence or absence of loo...

متن کامل

Sequence to Sequence Model for Video Captioning

Automatically generating video captions with natural language remains a challenge for both the field of nature language processing and computer vision. Recurrent Neural Networks (RNNs), which models sequence dynamics, has proved to be effective in visual interpretation. Based on a recent sequence to sequence model for video captioning, which is designed to learn the temporal structure of the se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i4.16421